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Birnbaum's three-parameter logistic item response 
model was used to study guessing behavior of low ability examinees on 
the Graduate Record Examinations (GRE) General Test, Verbal Measure. 
GRE scoring procedures had recently changed, from a scoring formula 
which corrected for guessing, to number-right scoring. The 
three-parameter theory was used to assess (1) the effect of this 
scoring change on the probability of a correct response; (2) 
differences in the probability of correct response for each of the 
four item types (analogies, antonyms, sentence completion, and 
reading comprehension); and (3) p ediction of guessing according to 
differences in probabilities of correct response. The LOGIST computer 
program v/as used to estimate item, person, and c-parameters . Analysis 
of variance indicated that differences attributable to scoring 
instructions were small and not significant. For three of the four 
item types, the ms-an c-parameter was 15 to 20 percent lower than what 
would heive occurred from random guessing. For the antonym item type, 
however, the mean c was equal to the probability expected from random 
guessing. Although some issues were raised suggesting further 
research needs, it was concluded that item response c-parameter 
theory wtis suitable for studying guessing. (GDC) 
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Presented April 3> 1983 at the Annual Meeting of the National Council on 
Measurement In Education as part of the syraposlum> ^'Dynamics of Guessing 
Behavior: Methodological Approaches." 

I 

'The programming assistance of Louann Benton is gratefully acknowledged, 
opinions expressed in this paper are solely those of the author. 
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The 



INTRODUCTION 

When an examinee taking an abili\:y or achievement test is faced with an 
Item for which he cr she is not sure ae to the correct answer, a complex 
decision making process might occur. Assuming that the examinee wants to 
obtain as high a score as possible, given the scoring instructions for the 
test, when faced with an item for which the correct response is unclear, the 
examinee can determine and follow some strategy to maximize her or hl^s score. 
This strategy will be affected by partial information and misinformation that 
the examinee may have. Finally, examinees typically are not purely rational 
decision theorists. Various personality traits affect an examinee's behavior 
in the face of uncertainty. 



PURPO SE 

In October 1981 the GRE General Test (called the Aptitude Test until 
October 1982) switched from using formula-scoring instructions to right- 
scoring instructions. In order to explore the use of item response theory to 
study examinee guessing behavior, this paper addresses several questions: Did 
this change affect the probability of responding correctly to an item for very 
lew ability examinees? Also, are there any consistent differences in the 
probability of a cor-^ect response for very low ability examinees for the four 
GRE verbal measure 1 am types. Finally, can any hypotheses about examinee 
guessing behavior be generated from bserved differences in probabilities of 
correct responses of low ability ex^^minees. 



THE THREE-PARAMETER ITEM RESPONSE MODEL 

^ The three-parameter logistic model (Birnbaum, 1968) assumes that for an 
examinee of given ability, theta, three statistical aspects of the an item 
determine the probability that the examinee will respond correctly: a, the 
discriminating power of the item; b, the difficulty of the item; and c, the 
lower asymptote of the item response function. The c-parameter represents the 
probability that an extremely low ability examinee (theta approaching negative 
infinity) will get the item correct. The c-parameter has been referred to as 
a guessing parameter, but since for most multiple-choice items its value is 
less than the chance probability of a correct response (that is, 1/A, where A 
is the number of response optioas), which is what would occur with random 
guessing, it is more frequently referred to as a pseud o-guessing parameter, or 
simply as a lower asymptote parameter. 
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TEST EDITIONS AND SAMPLES 



The GRE General Test verbal treasure caislsts of four item types: 
analogies, antOwyms, sentence completion, and reading comprehension* Descrip- 
tion and examples ot these item types can be found in any edition of the GRE 
Information Bulletin (e.g., ETS, 1984)* The verbal measure, as did the other 
General Test measures, underwent several changes as of October 1981 • Fore- 
most, the scoring Instructions changed from formula (rights minus one-ouarter 
wrongs) to number-right* Thus, It was more clearly In the examinees* inter- 
ests to guess when they were unsure of the answer to an item for the posr,- 
October 1981 verbal measure « 

Table 1 presents for each test edition, the administration date, number 
of analogy, antonym, sentence completion, and reading comprehension items, the 
overall difficulty of the verbal measure (raean equated delta), the mean scaled 
score of the sample, and the sample size* All items have five response 
options. 



Table 1 



Test 
Edition 


Admin . 
Date 


Number 
Anal* Ant. 


of Items 
S.C. 


R.C. 


Mean 
Delta 


M'3fln 
icore 


Sample 
Size 


FS-A 


12/79 


18 


20 


17 


25 


11.8 


498 


4,574 


FS-B 


2/80 


18 


20 


17 


25 


11.8 


472 


4,475 


FS-C 


A/80 


18 


20 


17 


25 


11.8 


472 


4,835 


FS-D 


6/80 


18 


22 


13 


22 


11.8 


473 


2,984 


RS-E 


10/81 


17 


20 


13 


22 


12.0 


495 


4,408 


RS-F 


12/81 


18 


22 


lA 


22 


11.8 


496 


4,096 


RS-G 


2/82 


18 


22 


lA 


22 


11.9 


482 


3,746 


RS-H 


4/82 


IB 


22 


14 


22 


11.9 


465 


3,647 


RS-I 


10/82 


18 


22 


14 


22 


11.8 


500 


4,331 


RS-J 


A/83 


18 


22 


14 


22 


11.9 


485 


3,825 
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DATA ANALYSIS 



The program LOGIST was used to estimate item and person parameters 
based on the three -parameter logistic model for four editions of the GRE 
General Test administered between December 1979 and June 1980 under formula- 
scoring Instructions, and for six editions administered between October 1981 
and April 1983 under right-scoring instructions. Sample sizes ranged from 
about 2,900 to 4,800. This paper presents comparisons of estimated 
c^parameters for the four GRE verbal item types; analogies, antonyms, 
sentence completion, and reading comprehension, administered under formula- 
and right-scoring instructions. 

It should be noted that there were two Important differences in the 
estimation procedures for the c-pararaeters that might have influenced the 
results of this study. Both relate to the procedures LOGIST uses to estimate 
the c-parameter for items that have insufficient data at the lower asymptote. 
LOGIST allows the user to decide how much data is "enough" for estimating c. 
At the time the parameters were estimated for the f ormul a- scored tests, the 
author was conservative and there were many items for which it was decided 
there was insufficient data to estimate a unique c. After obtaining more 
experience with both LOGIST and GRE data, when LOGIST was used to estimate 
parameters for the right-scored tests, a less conservative approach was used 
and a unique c was estimated for a considerably larger proportion of the 
items. This is reflected in the difference between the standard deviations 
for the two scoring instruction conditions (see Table 2). Perhaps more critic- 
ally, the procedure for estimating the "common c" for those items for %^ich 
there were not sufficient data differed. The version of LOGIST used to esti- 
mate parameters for the formula-scored tests used the median of the c-param- 
eter estimates of those items for which there were unique estimates. The more 
recent version of LOGIST used to estimate parameters for the right-scored 
tests estimated the common c using modified maximum likelihood methods based 
on combined data for all such items (Wingersky, 1983). 

To determine ihe effect of the change from formula-scoring to right- 
scoring instructions for the four GRE verbal item types, a two-way, unweighted 
means analysis of variance was performed on the estimated c-parameters (Winer, 
1971, cbApter 5.22). 

RESULTS 

Table 2 presents the standard deviations for each cell in the ANOVA. 
Although the standard deviations show clearly that the assumption of horaogen- 
eous variances is violated, it has been shown that ANOVA is robust to viola- 
tions much more severe than this (Box, 1954). 

Table 3 presents the mean of the estimated c-parameters for each item 
type and scoring instruction condition. Table 4 presents the analysis of 
variance The differences attributable to scoring instructions are very small 
and are not statistically significant at any commonly accepted levels The 
differences among the four item types are statistically significant at consid- 
erably beyond the .0001 level. Tlie mean c for antonyms is higher than that 
for the three other verbal item types. Although the interaction is not 
statistically significant at any commonly accepted level, it is interesting to 
note that while for analogy 'terns the mean c was .02 higher under formula- 
scoring instructions than under right-scoring instructions, for reading compre- 
hension items the mean c was #01 lower under formula-scoring instructions. 



Table 2 

Standard Deviations of c-Farameter Estimates 



Scoring 






Sentence 


Reading 


Method 


Analogies 


Antonyms 


Completion 


Coraprehensl on 


Formula 


.04 


.06 


.05 


.04 


Right 


.08 


.08 


.09 


.08 



Table 3 

Means of c-Parameter Estimates 



Scoring 






Sentence 


Leading 




Method 


Analogies 


Antonyms 


Completion 


Comprehension 


Marginal 


Form 111 a 


.17 


.20 


.17 


.16 


.18 


Right 


.15 


.20 


.16 


.17 


.17 


Marginal 


. 16 


.20 


.17 


.16 


.17 



/ 

Table 4 



Analysis of Variance 



Source of 
Variation 


df 


SS 


MS 


F 


P 


Item Type 


3 


.3421 


.1140 


23.23 


<.01 


Scoring Instructions 


1 


.0051 


.0051 


1.17 


.28 


Interaction 


3 


.0263 


.0088 


1,78 


.15 


Error 


759 


3.7252 


.0049 







Note^ marginals are based on unweighted cell means* 



DISCUSSION 



It has often been noted that for most items c is less than 1/A, the 
probability that would be expected if a group of examinees guessed at random. 
Indeed, for three of the four verbal item types the mean c was 15 to 20 per- 
cent lower than the ,20 thau would have occured from random guessing (that is, 
the mean c was .17 or .16), For the antonym item type, however, the mean c 
was equal to 1/A. As is not *musual for an exploratory study, more questions 
were created than were answered. 

1. It has been hypothesized (Lord, 1980), the finding that c 
tends to be less than 1/A indicates that many very low ability 
examinees do not guess at random, and tend to be mislead by 
plausible distractors. Does the finding that this is not 
affected by scoring instructions indicate that this is a 
function of the same major dimension underlying test scores, 
or might there still be some other dimen8ion(s) , perhaps 
personality traits, that partially explain this phenomenon? 

2. Do very low ability examinees guess at random for antonym 
items, but are they mislead into picking plausible distractors 
more frequently than would be accounted for by chance for the 
other three item types? It has been hypothesized (Petersen, 
personal communication) that if an examinee does not recognize 
the stem word, he or she will not be able to make use of 
either partial Information or misinformation that exist in the 
distractors. For analogies, sentence completion, and reading 
comprehension items, however, numerous pieces of information 
are available in both the stem and the distractors* 

3. What is it about antonym items that makes guessing behavior 
for them so different than for sentence completion items, even 
though for the GRE population, scores for the two item types 
correlate almost perfectly when corrected for unreliability 
(for example, for edition RS-H the uncorrected correlation 
between raw scores on antonyms and sentence completion was .71 
and the correlation corrected for unreliability was .98)? 

CO NCLUSIONS AND RECOMMENDATIONS 

Once again, more research is necces^iry. Clearly, a stronger research 
design would be useful, especially since analyses of c-parameters are essen- 
tially based on only a small portion of one*s samples. Using the identical 
Items for the two scoring conditions ^ould have provided a more powerful analy- 
sis. Using the more recent versio^i of LO'SIST for both parts of the analysis 
would also have strengthened this research. But, I believe that this paper 
has done what I set out to do: demonstrated that the IRT c-parameter has 
potential for shedding light on examinee guessing behavior. 
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